Statistical Models for Prediction, Chap.4
4.2 Binary Outcomes
2024-12-19
Binary outcomes
- model: logistic regression
\[
\mathrm{logit}(p(y = 1)) = a + b_i \cdot x_i
\]
- estimation: ML, penalized ML
- interpretation: coefficients relate to 1 unit difference in \(x_i\)
\(R^2\) in logistic regression
better models have a wider spread in predictions
![]()
Fig.4.4
\(R^2\) on log-likelihood scale
\[
LL = \sum y \cdot \log(p) + (1-y) \cdot \log(1-p)
\]
- perfect model: \(LL = 0\)
- usually: \(LL < 0\) and deviance: \(-2LL > 0\)
- comparing with null model = likelihood ratio:
\[
LL_0 = \sum y \cdot \log(\mathrm{mean}(y)) + (1-y) \cdot \log(1 - \mathrm{mean}(y))
LR = -2(LL_0 - LL_1)
\]
\(R^2\) on log-likelihood scale
\[
R^2 = \left( 1- e^{-LR} \right) / \left( 1 - e^{-2LL_0} \right)
\]
![]()
Fig.4.6
Bayes rule
- prior probability of disease: \(p(D)\)
- posterior probability of disease: \(p(D|x)\)
- diagnostic likelihood ratio for symptom \(x\):
\[
LR(x) = \frac{p(x|D)}{p(x|!D)}
\]
\[
\mathrm{Odds}(D|x) = \frac{p(D))}{(1-p(D))} \cdot LR(x) \\
logit(D|x) = logit(D) + log(LR(x))
\]
- similar to univariate logistic model!
Prediction with Naïve Bayes
- prediction of symptoms’ combination:
post-\(x_1\) is prior for \(x_2\), post-\(x_2\) is prior for \(x_3\), etc.
- only for conditionally-independent variables!
- might give very good discrimination!
- applied for effects of genetic markers
- simple correction for correlated predictors:
add calibration slope to the model
\[
\mathrm{logit}(y) = \alpha + \beta_{cal} \cdot lp_u
\]
Neural networks
- GAM
- generalized nonlinear models: NN (neural networks)
- input layer - hidden layer(s) - output layer
- iterative learning
- penalization not to “overtrain”
Tree models
- classification and regression tree (CART) aka recursive partitioning
- splitting of patients based on cut-off:
- maximum separation between subgroups
- minimum variability within subgroup
- many trees: random forest
Tree models
advantages and disadvantages
advantages:
- simple presentation
- interaction effects incorporated
disadvantages
- all continuous variables categorized
- cut-offs lead to overfitting
- interactions between all predictors
- some predictors are only in certain branches
- poor prediction performance of tree models
Other methods
- multivariate additive regression splines (MARS)
- support vector machine (SVM)
- regression and classification